Designing a Data Mesh Pattern for Amazon EMR-based Data Lakes Utilizing AWS Lake Formation with Hive Metastore Federation

In this article, we explore fundamental elements of leveraging Amazon EMR for contemporary data management, focusing on data governance, the implementation of a data mesh, and the facilitation of efficient data discovery. A significant hurdle in managing large-scale data today is ensuring effective data sharing and access control across various EMR clusters. Many organizations maintain multiple Hive data warehouses within their EMR clusters, generating metadata that becomes cumbersome to manage. To tackle this issue, businesses can implement a data mesh utilizing AWS Lake Formation, thereby connecting diverse EMR clusters. The introduction of AWS Glue’s Data Catalog federation with external Hive metastore functionality allows users to enforce data governance on metadata spread across these EMR clusters, facilitating analysis through AWS analytics services like Amazon Athena, Amazon Redshift Spectrum, AWS Glue ETL jobs, EMR notebooks, and EMR Serverless. For an in-depth understanding of managing your Apache Hive metastore with Lake Formation permissions, be sure to check out their insights here.

This article outlines a strategy for deploying a data mesh that encompasses multiple Hive data warehouses across EMR clusters. This approach empowers organizations to leverage the scalability and flexibility of EMR while maintaining oversight and integrity of their data assets within the data mesh.

Use Cases for Hive Metastore Federation with Amazon EMR

Hive metastore federation for Amazon EMR is relevant for various scenarios, including:

Governance of Amazon EMR-based Data Lakes: Producers generate data within their AWS accounts using an Amazon EMR-based data lake supported by EMRFS on Amazon S3 and HBase. These data lakes necessitate governance for access without requiring data to be transferred to consumer accounts, significantly reducing storage costs since the data remains on Amazon S3.
Centralized Catalog for Published Data: Multiple producers release data governed by their respective entities. A centralized catalog is essential for consumer access, allowing producers to publish their data assets.
Consumer Personas: Consumers include data analysts querying the data lake, data scientists preparing data for machine learning models and conducting exploratory analyses, and downstream systems executing batch jobs on the data within the data lake.
Cross-Producer Data Access: Consumers may require access to data from multiple producers within the same catalog environment.
Data Access Entitlements: Implementing restrictions at the database, table, and column levels to ensure appropriate data access control.

Solution Overview

The diagram below illustrates how data from producers with their own Hive metastores (left) can be made accessible to consumers (right) through Lake Formation permissions enforced in a central governance account.

Producers and consumers reflect logical concepts of data production and consumption within a catalog. An entity may function as both a producer of data assets and a consumer of them. The onboarding of producers is facilitated through metadata sharing, while consumer onboarding is based on permissions granted to access this metadata.

The solution involves several steps across producer, catalog, and consumer accounts:

Deploy AWS CloudFormation templates to establish producer, central governance, catalog, and consumer accounts.
Validate access to cataloged Amazon S3 data using EMR Serverless from the consumer account.
Execute Athena queries to test access in the consumer account.
Confirm access through SageMaker Studio in the consumer account.

Producer

Producers generate data within their AWS accounts utilizing an Amazon EMR-based data lake and Amazon S3. Multiple producers subsequently publish their data into a centralized catalog account. Each producer account, along with the central catalog account, requires either VPC peering or AWS Transit Gateway to enable AWS Glue Data Catalog federation with the Hive metastore.

For each producer, an AWS Glue Hive metastore connector Lambda function is deployed in the catalog account, allowing the Data Catalog to access Hive metastore information at runtime from the producer. The data lake locations (the S3 bucket locations of the producers) are registered in the catalog account.

Central Catalog

A catalog provides governed and secure data access to consumers. Federated databases are created within the catalog account’s Data Catalog using the Hive connection, managed by the catalog Lake Formation admin (LF-Admin). The LF-Admin shares these federated databases with the consumer LF-Admin of the external consumer account.

Data access entitlements are controlled through access rules applied at various levels, such as database or table.

Consumer

The consumer LF-Admin assigns necessary permissions, or restricted permissions, to roles including data analysts, data scientists, and downstream processing engine AWS Identity and Access Management (IAM) roles within its account. Data access entitlements are managed through access controls based on specific requirements, at various levels such as databases and tables.

Prerequisites

Implementing this solution requires three AWS accounts with admin access, ideally for testing purposes. The producer account hosts the EMR cluster and S3 buckets, the catalog account houses Lake Formation and AWS Glue, and the consumer account accommodates EMR Serverless, Athena, and SageMaker notebooks.

Setting Up the Producer Account

Before launching the CloudFormation stack, gather the following information from the catalog account:

Catalog AWS account ID (12-digit account ID)
Catalog VPC ID (e.g., vpc-xxxxxxxx)
VPC CIDR (catalog account VPC CIDR; it should not overlap with 10.0.0.0/16)

The VPC CIDR of the producer and catalog accounts cannot overlap due to VPC peering and Transit Gateway requirements. The VPC CIDR should be from the catalog account where the AWS Glue metastore connector Lambda function will be deployed eventually.

The CloudFormation stack for the producer creates the following resources:

S3 bucket for hosting data for the Hive metastore of the EMR cluster.
VPC with CIDR 10.0.0.0/16 (ensure no existing VPC uses this CIDR).
VPC peering connection between the producer and catalog accounts.
Amazon Elastic Compute Cloud (Amazon EC2) security groups for the EMR cluster.
Necessary IAM roles for the solution.
EMR 6.10 cluster with Hive launched.
Sample data downloaded to the S3 bucket.
A database and external tables pointing to the downloaded sample data in its Hive metastore.

Complete the Following Steps:

Launch the template PRODUCER.yml using an IAM role with administrator privileges.
Collect the following values from the CloudFormation stack’s Outputs tab:
- VpcPeeringConnectionId (e.g., pcx-xxxxxxxxx)
- DestinationCidrBlock (10.0.0.0/16)
- S3ProducerDataLakeBucketName

For further insights on similar topics, visit this blog, as they are an authority on this topic. Additionally, if you’re interested in understanding the hiring process at Amazon, this is an excellent resource here.

For those looking to engage further, visit Amazon IXD – VGT2, located at 6401 E Howdy Wells Ave, Las Vegas, NV 89115.